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Abstract 

We provide new methods for estimation of the one-point specification probabilities in 
general discrete random fields. Our procedures are based on model selection by minimization 
of a penalized empirical criterion. The selected estimators satisfy sharp oracle inequalities 
without any assumption on the random field for both i2-risk and Kiillback loss. We also 
prove the validity of slope heuristic for the specification probabilities estimation problem. 
We finally show in simulation studies the practical performances of our methods. 

1 Introduction 



Random fields are used in variety of domains including computer vision |Bes93| IWoo78| , image 
processing |CJ83j . neuroscience [SBSB06] . and as a general model in spatial statistics |Rip81| . 
The main motivation for our work comes from neuroscience where the advancement of multi- 
channel and optical technology enabled the scientists to study not only a unit of neurons per 
time, but tens to thousands of neurons simultaneously (TSMIlOj . The important question in 
neuroscience is to understand how the neurons in this ensemble interact with each other and 
how this is related to the animal behavior [SBSB06, BKM04]. This question turns out to be 
hard for three reasons. First, the experimenter has always only access to a small part of the 
neural system, which means that the system is partially observed. Also, there is no good and 
tractable model for population of neurons in spite of the good models available for single neu- 
rons, therefore very general models must be considered. Finally, strong long range multi-neuron 
interactions exist |LOU + I0l . Our work overcomes these difficulties as will be shown. 



A random field is a triplet (S, A, P) where S is a discrete set of sites, A is a finite alphabet 
and P is a probability measure on the set X{S) = A s of configurations on S. Given a random 
field (S, A, P), we define the one point specification probabilities of P as regular versions of the 
following conditional probabilities, for all sites i in S, for all configurations x in X{S), 

P MS (x) = P(x(i)\x(j), jeS/{i}). 

The specification probabilities are important in the applications as they encode some condi- 
tional independence between the sites, see for example lBM09LlBMS08llCT06allGOT101IRWL101 
ILT11] . The main goal of this paper is to provide good estimators of the specification proba- 
bilities. We do not assume that the set of sites S is finite. However, the set of observed sites, 
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Vm C S is finite, with cardinality M. Let Xi :n = {X±, ...,X n ) be an i.i.d. sample with marginal 
law P, the observation set consists in X\- n (yM) = PQ(i))i=i,...,n; jeV M - As au * ne results are 
non asymptotic, M is allowed to grow with n. 

This model enables us to handle the following situations that are of particular importance 
in neuroscience. 

Example 1: Dynamic estimation of the connected neurons 

S is composed by T copies of the set of all neurons, Vm is composed by T copies of the set of 
observed neurons. A configuration represents the neural activity in a period of time of size T. 
For all neurons i at time t! , the support of Pi^\s defined by the minimal subsets C of S such 
that Pi t t'\s = Pi,t'\C represents the set of neurons connected to i in a period of time of size T. 
In practice C usually is not totally contained in Vm and we don't know the shape of P. The 
problem is to obtain a good approximation of C , observing only the configurations in Vm- 

Example 2: Prediction of animal behavior 
S is composed by the union of T copies of the set of all neurons and T copies of an other site 
lo! Vm is composed by the union of T copies of the set of observed neurons and T copies of 
X Q . A configuration represents the neural activity and an associated animal behavior response 
(x(I ,t))t in a period of time of size T. The support C of Px ,t\S represents then the set of 
neurons that should be observed during a period of time smaller than T around t to predict the 
behavior of the animal at time t. Again, the problem is to obtain a good approximation to C 
knowing very few about the animal behavior and the neural system. 

Our estimators are derived from a model selection procedure by minimization of a penalized 
empirical criterion. This procedure selects a subset V with cardinality V = O(lnra). Our first 
result is that the empirical conditional probabilities P^ as estimators of Pqg satisfy a sharp 
oracle inequality (see Section [2] and Theorems 3.2 and 3.4 for details). 

The second result of the paper is a proof of the slope heuristic for the estimation of specifi- 
cation probabilities. The heuristic, introduced in [BM07J, is a data driven way to optimize the 
constant in front of the penalty term of the selection procedure. This heuristic is very important 



in practice, because the constants involved in Theorems |3.2| and 3.4 are generally pessimistics 
(see for example Figure [4] in our simulation study in Section [5]). 

In most of the applications, the support of Pqs, defined as the minimal subset V* C S such 
that Pi\v+ = Pi\s ( see Section [2] for details) is usually the object of interest. This is why most 
of the literature focus on the estimation of V* see |BM091 IBMS081 ICT06al IGOTlOl IRWLlOj for 
example. This approach requires in general strong assumptions on the random field, e.g., to be 
Ising models with strong conditions on the temperature parameter [BM091 IGOT10[ IRWLlOj . 
In particular, [BM09, BMS08, RWL10J assumed that the set S is finite and that all the sites 
are observed, i.e that Vm = S. When Vm does not contain T4, the meaning of the estimator 
in these paper is not clear. |CT06a| considered S = Z d but assumed that V* is finite. Finally, 
|GOT10[ ILTllj worked with infinite sets of sites and without a priori bound on the number 
of interacting sites but required a two-letters alphabet A and some assumptions on P that 
the practitioner can not easily verify. These restrictions are severe for applications, e.g., in 
neuroscience, and cast doubt on the theoretical support for application of these methods in 
practice. Our model selection procedure does not suffer from these drawbacks. 

We focus here on the estimation of the conditional probabilities and we develop the oracle 
approach introduced in [LT11]. As we already noticed in this paper, an oracle V provides a nice 
estimator of the support of Pi\g. In [LT11], we used the Loo-norm to measure the risk of the 
estimators. We use now the Z/2-norm and the Kiillback loss and the new results do not require 
any restriction on the random field. In particular, the finite alphabet A is not restricted to have 
two letters, P does not need to be a Gibbs measure, therefore doesn't need to be a Ising or 
Potts model, and the size of the support V* of P^g can be infinite. To our knowledge this is the 
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first work with this degree of generality. 

Theoretical support for the slope heuristic is currently an active area of research and it has 
been proven only for very few specific models |BM071 IAM09| ILerllbl ILerllal lABlOj . Here, 
we prove the validity of the slope heuristic for conditional probabilities estimation for Li and 
Kiillback loss. Our proof technique is novel and sheds new light on the slope heuristic. 

The paper is organized as follows. Section [2] presents the framework and some notations 
that we use all along the paper. Section [3] gives the model selection procedures and the oracle 
inequalities satisfied by the selected estimators. Section [4] is devoted to the slope heuristic. We 
recall the heuristic and state the theorems that justify it in our problem. Section [5] illustrates 
the results of previous sections using some simulation experiments and Section [6] discuss the 
results, making a detailed comparison with other works on similar problems. The proofs of the 
main theorems are postponed to Section [7] and the probabilistic tools used in the main proofs 
are proved in Section [8} 



2 Preliminaries 

Hereafter, we call random field a triplet (S, A, P) constituted by a discrete set S of sites, a finite 
alphabet A of spins, with cardinality a and a probability measure P on the set of configurations 
X(S) = A s . More generally, for all subsets V of S, let X(V) = A v be the set of configurations 
on V. For all x in X(S) and all subsets V of S, we denote by x(V) = {x(j))j^y. For all i in S, 
for all subsets V of S, for all x in X(S) and for all probability measures Q on X(V U {i}), let 

Q ilv (x) = Q(x(i)\x(V/{i})) 

be a regular version of the conditional probability. Hereafter, we use the convention that, if V is 
a finite subset of S, if Q is a probability measure on X(V) and x is configuration in X(VU {i}) 
such that Q(x(V/{i})) = 0, then Q^ v is the uniform law on A. 

For all probability measures Q on X(S) and for all real valued functions / defined on X(S). 
We define the Z^Q-norm of / by 



Q 



rl ,JQ{x{S /{{})) 



We also define the logarithmic loss of a non- negative function / defined on X(S) by 

1 



LqU) 



In 



dQ(x) 



Let P be a probability measure on X (S) and let X\, X n be i.i.d P. We introduce the empirical 
probability measures P defined on X(S) by 



1 n 
P{x) = -Vl 



X k =x- 



k=l 



For all subsets V of S, Pqy is an estimator of Pn.q. We define the Z/2-risk of the Pj 



i\S- 



Pi\V - P%\3 



This risk is decomposed via Pythogoras relation to (see Proposition 



8.17) 



by 



P%\V - P i\S 



P%\V ~ Pi\V 



\Pi\V - Pi\S\\p ■ 
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The random term of the risk, 



P%\V - P i\V 



2 

is called the variance term, and the deterministic 

p 



II l|2 

term \\Pi\v ~ Pi\s\\p ls called the bias term of the risk. 
We also define the Kullback loss of the estimator Pjiy by 

K(P ils ,P llv ) = L P (P i]v ) - L P (Pi\ s ). 

The Kullback risk is also decomposed in a variance term and a bias term thanks to the relation 

K(P ils ,P i]v ) = (l p {P av ) - L P (P AV )) + (Lp(P tlv ) - L P (P lls )) 

= £ p(^))i n fa^) + /™)in(^§) 

X ^ {V ) \Pi\vW) J \Pi\v{x)J 

= K{P i \y,P i \ V )+K{P i \ S ,P l \ V ). 

Let Vm be a finite subset of S with cardinality M > e of observed sites and let Xi :11 (Vm) = 
(Xi(j), X n (j))j£v M be the observation set. For all V C Vm, let v = Card(V). Let s > e be 
an integer and let 

V s = {V C V M , v < s} , N s = Card(V s ). 

Let A > 100, 5 > 1 and let 

V s ,a = | V G V s , Vx G P(x(F)) = or P(x{V)) > A ^^M1 j . (i) 

V s (2 | = | V G V s , Vx G #(5), P(x(F)) = or P(x(^)) > A ^ 2 ^^ J . ( 2 ) 
Let p* > 0, and let 

Vs,a,p, = { V e V SiAj Vx G P i]v (x) = or p, y (x) > p* } . (3) 

Vflp. = { ^ G V$, Vx G X(S), P ilv (x) = or P i|v (x) > p* } . (4) 

The idea of the sets V S) A,p„ is that we restrict the collections of sets V to those where the possible 
configurations are sufficiently observed. This restriction will only be required when we will work 
with the Kullback loss. The main advantage of the sets V Sj A.p, is that the conditions can be 

(2) 

verified in practice. In order to illustrate why we introduced V. I _ , let us give the following 
weak Gibbs assumption. 

GA There exists p* > such that, for all finite subsets V, for all sites i, and for all x in X(V), 

P(x(V)) = QorP ilv (x)>p*. 
We have (see |Mas07] Proposition 2.5 p20) 



fc=o 

Hence, ln(a s N s 5) < s(hx(aM)) + ln6. We have 



u-n V s / 



nP(x(V)) glnn-sln^ 1 ) 

' > -> +00, 



Aln(2a s N s 8) ~ A(s ln(aM) + ln(<5)) 

if (lnn) _1 s < s* = (Inp" 1 ) -1 and Aln(M5) = 0(n a ), where a < a+ = 1 — s*. In that case, for 
alln>n(p,), V S = V S (2 1 = V S (2 1^. 
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3 Model Selection Results 
3.1 The quadratic loss 

Our first theorem is a concentration inequality for the variance term of the L2 risk. 

Theorem 3.1. Let (S,A,P) be a random field and let V in V s . Then, for all 5 > 1 and all 

< 77 < 1, we have, with probability larger than 1 — <5 _1 , each of the following 



Pi\V 


~Pi\V 


Pi\V 


- Pi\V 



2 6 

< - 
p a 



s a? 41n(2(J) 91n(2<5) 2 
1 +877 - + — + Y-t- 



2 6 A ,a v 41n(2<5 9 In 2<5) 2 

~ <- l + 8r/ — + ^ + 

p a V n 



ry 4 n 



(5) 
(6) 



Comments: 

• The risk of the estimator is upper bounded, with probability larger than 1 — 5 , by 

2 
p 



Pi 



i\V ~ Pi\S 



II I II/- n n 



This control depends on the approximation properties of V through the bias \\Pi\v ~ Pi\s\\p 
and on the complexity a v of V. In practice, we would like to find a model V that optimizes 
this bound, even though the bias term is completely unknown. This is precisely the aim 
of the following result. 



Theorem 3.2. Let (S, A, P) be a random field. Let K > 1 and let 



V = arg min 

V&Vs 



P 



i\V 



+ pen(V) > , where pen(F) > 



6K a v 



a n 



Then, there exists a constant k = n(a,K) such that for all 5 > 1, with probability larger than 
l-4<5-\ 



P i\S ~ P i\V 



< k ( inf {\\P i]s -Pi\vf P + pen(F)} + 



(\n(N*5)f 



n 



(7) 



Moreover, when K > 2, there exists a constant k = n(a, K) such that, with probability larger 
than 1 - 45 _1 , 



P i\S ~ P i\V 



< 1 + 



ln(<5) / vev, 



■„f {||P j|s -P j|v ||J, + pe „ W } +re MW 



n 



(8) 



Comments: 



• We have ln(7V 2 5) < 2s(ln(M)) + In 5. Denoting T 8jM (8) = 2s(ln(M)) + In 5, with proba- 
bility larger than 1 — 4J _1 , 



P 



i\S ~ P i\v 



< 1 + 



ln(<5) 7 Vev s 



P; 



i\S 



Pi\v\\ P + Cl — 
up- n 



+ C 2 



We have found a model that optimizes the bound given by Theorem 3.1 up to the s ln(M) 
term, among all the subsets of V. Remark that this is only the price to pay to make the 
bound of Theorem 3T uniform over all the subsets of V s . 

A very interesting feature of this result in view of the applications is that it holds without 
restrictions on the random field (S,A,P). 
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3.2 The Kiillback Loss 

Our first result is a sharp control of the variance term of the Kiillback risk. 

Theorem 3.3. Let (S,A,P) be a random field, let A > 100, S > 1, s > 0. Let V Sj a be the 
collection defined in Q). Then, with probability larger than 1 — <5 _1 , for all V in V S: a, for all 
T] > 0, we have 

^^{^H^<<^y-^y 



x€X(V) 



Let be the collection defined in (2). Then, with probability larger than 1 — 5 1 , for all V 



(2) 

V s A , for all 7] > 0, we have 



in 



xex(v) 
Comments: 

• The variance part of the Kiillback risk is controlled as the variance part of the L2 risk. We 
only have to restrict the study to the subset V S; a of V s where all the possible configurations 
are sufficiently observed. This restriction is not important when s « n, and our result 
holds also without restriction on the random field. 

As in the previous section, we want to optimize the bound on the Kiillback loss given by Theorem 



3.3| among V S) a- We introduce for this purpose the following penalized estimators. 

P(x(V))ln(p ilv (x))+pen(V)\. (9) 



V = arg min < 



xex(v) 



V {2) = argmin \ - ^ P(x(V))ln (p t{v (x)) + pen(V) \ . (10) 



v ^ V sl P , { ^X{V) 

The following theorem shows the oracle properties of the selected estimator when the penalty 
term is suitably chosen. 

(2) 

Theorem 3.4. Let s > 0, > 1. p* > 0, A > 100 and let V s a b , and V I be the collections 

defined in (|3p and (4). Let K > 1 and let V and V( 2 ) be the penalized estimators defined in (9) 
and IJol) with 



pen(y) > 9K- 



n 



Then, we have, for all r\ > 0, with probability larger than 1 — 3<5 , 



\^K P {P^P.p)< inf {^p(P l|s ,P^)+pen(y)} + (21nn+^^) ln( ^ ) . 
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Also, for all rj > 0, with probability larger than 1 — 35 1 , 

\^K P (P ils ,P 9 )< inf {^p(P, |5 ,P 4|y ) + pen(F)}+(21nn+^^)^M. 
1 + 7/ 1 *l^(2) veV () A V rj J n 

Comments: 

• We use the same kind of penalty as in the L2 case. This is not surprising because the 
variance parts of the risks were controlled in the same way. 



• We do not optimize the bound obtained in Theorem 3.3 among all the sets in V s ,a- We 
have to restrict ourselves to V s ,A,p*.- However, the constant C^,p t ,K has the form p* 1 Ca,k- 
Therefore, we can choose p* = (Inn) -1 and optimize the result asymptotically. 

• We optimize the bound among all V s under the weak Gibbs assumption GA. 

4 The slope heuristic 



In practice, the constants 6Ka in Theorem 3.2 and 9K in Theorem 3.4 are a bit pessimistic 



In order to optimize these constants, Birge and Massart |BMQ7| have introduced the slope 
heuristic. It states that there is a minimal penalty pen min satisfying the following properties. 

SHI When pen(V) < pen min (V), the complexity of the selected model is as large as possible. 

SH2 When pen(V) is slightly larger than the minimal penalty, the complexity of the selected 
model is much smaller. 

SH3 When pen(V) is equal to 2 times the minimal penalty, then the risk of the selected model 
is asymptotically the one of an oracle. 

In practice, the heuristic is used to calibrate the constant in front of the penalty. Suppose that 
some quantity Ay proportional to the complexity is known (in the simulations, we will use 



Ay = a v /n, even though we only know thanks to Theorems 3.1 and 3.2 that it provides an 



upper bound on this complexity). We can apply the following algorithm. 

1. For all K > 0, we choose the model V(K) selected by the penalty pen(V) = AAy. 

2. We find K m [ n such that A^^ is very large for K < K m [ n and much smaller for K > K m i n . 

3. We select V = V(2K min ). 

The idea is that K m i n Ay shall be the minimal penalty pen min (y) because we observe a jump 
of the complexity A^ around K m \ n /S.y as expected by SHI, SH2. Therefore, V, chosen by 
2-KminAy = 2pen min (V) shall be optimal from SH3. 

There exists now several proofs of this heuristic in various problems, see for example [AM09] or 
[ABlOj for the problem of regression on histograms, [Lerllbj and [Lerllaj in density estimation, 
or [Ver09] for some partial justification of this heuristic in a Gaussian graphical model Selection 
problem. This section is devoted to the theorems justifying this heuristic in our problem. 
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4.1 The quadratic loss 

Theorem 4.1. Let (S,A,P) be a random field. Let r > 0, e > and assume that 

2 ' 



P VV~ € V s , < pen(V) < (1 - r) 



fi|V - fi|V 



> 1 - e. 



Let 



V = arg min 

vev s 



fi- 



ll v 



For aM 8 > 1, with probability larger than 1 — e — 25 , 



+ pen(V~) 



p p ~ 

i|v i|y 



_ > sup < r 



2 R,c- R 



i|S ~ -MlVjlp 



17 ln(JV s 2 «5) 



n 



Comments: 



When V is large, the deterministic term ||-B,|g — Pi\v\\ □ is very small compared to the 

2 



variance term 



ft|V ~ f»|V 



Theorem 



states that, if the penalty term is too smal 

2 



here with 



f»|V ~~ Pi\V 



4.1 is therefore a minimal penalty theorem. It 
"the complexity of the selected model (measured 

2 



is as large as possible. This is SHI with Ay 



Pi\v - Pi\V 



Let us now state the associated optimal penalty theorem which proves the slope heuristic. 

Theorem 4.2. Let (S,A,P) be a random field. Let 6 > 1, r% > 0, r-i > 0, e > and assume 
that 



P(vvev s , (l + n) 



Pi\v - P%\v 



< pen(V) < (l + r 2 ) 



Pi\V ~ P%\V 



> 1 - e. 



Let 







V = arg min < — 


Pi\V 


vev s [ 



+ pen(V) } . 



For all V in V s , let p_ = in£ x eX(V), P(x(V))^0 P ( X (V)) an ^ assume that, for some e < 1, 



inf p_ > e 



v „ 2 ln(niV s <5) 



n 



Then, there exists an absolute constant C such that, with probability larger than 1 — 55 1 — e, 
for all V inV s , for all r] > 0, 



(1-7?) A (ri-C(l + n)e) 
(1 + n) V (r 2 + C(l + r 2 )e) 



fj|S - f^iv 



< 



2 6 ln(./V B 2 a) 
p rj n 



(11) 



Comments: 

• Let us assume that e — > 0. First, take ri, r 2 close to 0. The penalty is therefore slightly 
larger than the minimal penalty 



p p ~ 

«|y «|y 



p — p ^ 

t|v «|y 



< C ri r2 n I inf 
p \ V&Vs 



It comes from (11) that 



fi|5 " fi|l 



+ 



6 HN^S) 
r\ n 



The complexity of the selected model is therefore the one of an oracle, which is much 
smaller than the maximal one. We observe a jump of the complexity of the selected 
model around pen min , this is SH2. 
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• Take n, r 2 equal to 1. The penalty is then equal to 2pen min (V). Inequality (11) states, 
in that case, taking r\ close to 0, that P^ v satisfies an oracle inequality asymptotically 
optimal. This is SH3. We have therefore justified the slope heuristic for the L2-risk. In 
the following section, we give the theorems justifying it for the Kiillback loss. 

4.2 Slope heuristic for the Kiillback Loss 



The purpose of this section is to give the equivalent of Theorems 4.1 and 4.2 in the case of 
Kiillback loss. 

Theorem 4.3. Let s > 0, S > 1, e > 0, r > 0, p* > 0, A > 100 and let V S ,A, P *, ^% Pt be the 
collections defined in and Qj. For all V in V s , let 

*e*(v) \^\v(x) J 

Let V be the penalized estimator defined in |s|) with a penalty term satisfying 

P ( VV 6 V s , A , Pt , < pen(V) < (1 - r)p 2 (V) ) > 1 - e. 
Then, we have, with probability larger than 1 — 25~ l — e, 

p 2 (V) > max (rp 2 (V) - 2K(P i]s , P i]v )} - ( A]nn+ A-) 



Let V(2) be the penalized estimator defined in (10) with a penalty term satisfying 

P (w e V s (2 ] jPt , < pen(V) < (1 - r)p 2 (V)) > 1 - e. 
Also, we have, with probability larger than 1 — 25" 1 — e, 

p 2 {V (2) )> max {rp 2 (V)-2K(P lls ,P llv )}- ln ^(4lnn+ ±) 

sA,p* 



Comments: 



Theorem 4.3 states that, when the penalty term is smaller than p 2 (V), the complexity 



p 2 {V) is as large as possible. This is exactly SHI, with pen min (V) = Ay = P2(V). 

Theorem 4.4. Let s > 0, 5 > 1, e > 0, r x > 0, r 2 > 0, p* > 0, A > 100 and let V s a»», vf 2 2 „ 
be the co llect ions defined in and Q). For all V in V s , let p 2 (V) be the quantity defined in 
Theorem 4-3. Let V be the penalized estimator defined in with a penalty term satisfying 

P(WeV S! a, p „ (l + n)p 2 (V) <pen(V) < (l + r 2 )p 2 {V)) > 1 - e. 

Then, there exists an absolute constant C such that, for all rj > 0, with probability larger than 
1 - 25- 1 - e, 

C L K(P lls ,P ilV ) < {K(P lls ,P llv )} + ( 2 ln(n) + (12) 
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where 

= (l-7 ? )A(r 1 -C(l + r 1 )A- 1 / 2 ) 
L (l + r 7 )v(r 2 + C(l + r 2 )A- 1 /2)- 

Let Vm\ be the penalized estimator defined in (10) with a penalty term satisfying 
PNV G V S (2 1^, {l + n)p 2 (V) <pen(y) < (1 + r 2 )p 2 {V)) > 1 - e. 



Abo, t/iere exists an absolute constant C such that, for all r\ > 0, with probability larger than 
1 - 25- 1 - e, 

C L K(P lls ,R l9i2) )< inf {K(P lls ,P tlv )} + ^Hn) + ^^Y^^, (13) 



Comments: 



Let us take A = 100 V ln(n). Take at first r\ and ?" 2 slightly larger than and therefore 
a penalty slightly larger than pen min . Then (12) implies that, when n is sufficiently large 
Cl > 0, hence 

p 2 (V) < K(P lls ,P^) < ( y£ inf ^ K(P ils ,P ilv ) + (21n(n) + 

<< SUP ^(Pjly^jly). 

vev s , A , p , 
This justifies SH2. 

Take now n and r 2 equal to 1, so that the penalty is equal to 2pen min . Then, we can take 
C L ->• 1 in @. This justifies SH3. 



5 Simulations 

In this section we illustrate results obtained in previous sections using simulation experiments. 
All these simulation experiments can be reproduced by a set of MATLAB® routines that can 
be downloaded from www.princeton.edu/~ dtakahas/publications/LTl lroutines.zip. 
Let S = {— 1, 0, 1} x {—1, 0, 1} and A = {—1, 1}. For all the simulations we consider an Ising 
model on A s , with one-point conditional probability for all x S A s given by 

Plls(X) = l + exp(-2E, 6 5^^Mi)) 

where the pairwise potential (Jij)ij^S is given by Jij = J\j^y i for J = 0.2 and Vi C G. The 
pair of sites (i,j) where j S is shown in Figure [T] For all these experiments, i = (0,0). 
We simulated independent samples of the Ising model with increasing sample sizes n = 100/c, 
k = 1, . . . , 100. For each sample size we have N = 100 independent replicas. 



5.1 Variance term of the risk 



The following experiment illustrates Theorem 3.1 and Theorem 3.3 



we computed the normalized variance terms, namely n 



Pi\Vi ~ p i\Vi 



For each sample size 

2 

for the L 2 -norm and 

P 



nK (Pj|y, P{\v) f° r the Kullback loss. The average values are described in Figure 2 and show 
that the variance term scales as 1/n. As the behavior of L 2 -norm and Kullback loss is quite 
similar, in what follows we will show the simulations results only for the L 2 -norm. 
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-1 







1 



Figure 1: Representation of the interacting pairs of the Ising model used in the simulation experi- 
ments. The edges between sites indicate the interacting pairs. The grey colored edges indicate the sites 
interacting with site (0, 0). 



L2 norm 



Kuback loss 




1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 




Sample size 



1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 



Figure 2: Plot of the number of samples n against n 



P%\V t ~ Pi\Vi 



for the 1/2 norm and nK{P i \y 1 Pi\y) 



for the Kiillback loss. The dotted lines indicate the linear regression lines. Observe that the regression 
line is essentially parallel to the abscissa. 



5.2 Slope heuristic 

Here we illustrate the slope heuristic. We use 500 samples from the Ising model described in 
the beginning of this section. We use as the measure of complexity, for i = (0, 0) and V C S, 
the quantity \\Pi\y — Pi\v\\p- I n Figure |3jwe plot the value of the measure of complexity against 



the criterion 

min{-||-Pj|y||J + 4 P i\V ~ p i\v\\p}, 

for the positive constants c < 8. We clearly see that when c is smaller than 1 the complexity 



is the largest possible and this is the content of Theorem 4.1 We also observe that when c 



is slightly larger than 1 there is a sudden decrease in the complexity, which is the content of 



Theorem 4.2 Finally, the model chosen by c = 2 is exactly the one given by oracle as predicted 



by Theorem 4.2 
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X - minimal penalty 
- 2 X minimal penalty 



4 

Constants 



Figure 3: Example of slope heuristic. Observe the sudden change in behavior around the minimal 
penalty. 



5.3 Oracle risk compared to the risk of the estimated model 



Observe that, in the simulation above, we used the quantity c||P$|y — v lip as t ne penalty 
term. In practice we cannot compute this quantity and we use instead the quantity cd 



v-l 



given in Theorem |3.2| Here we will illustrate the performances of the slope heuristic using this 
last quantity. Let V{2C m \ n ) be the neighborhood selected by the slope heuristic. One way to 
verify the performances of the slope heuristic is to compute the risk ratio 



P i\V(2C mill ) P i\s 



inf 



vcs 



Pi 



%\V - Pi\S 



(14) 



For each sample size, we computed the ratio ( |14[ ) for 100 different samples and we obtained 
the average. The result is summarized in Figure H| For comparison, we estimated also the 
average risk ratio for the model selected using the theoretical constant GKa~ 1 with K = 2 given 
by Theorem 3.2 Observe that when the sample size n increases, the risk ratio of the model 
estimated by the slope heuristic approximates one, as we expect from Theorem 4.2 Also, we 



observe that the slope heuristic has in general a better risk compared to the criteria using the 
theoretical constant. 



6 Discussion 

The problem of recovering the support V* C S of P^ig is an active area of research (see [BMS08, 
CT06a, GOT10, RWL10J. The main drawback of these works is the restrictions imposed to 
guarantee the results. In particular, it is always assumed that all the sites of interest are 
observed, i.e., Vm = S. This is never the case in important applications like neuroscience and 
molecular biology. In neuroscience, for example, the experimenter has only access to a tiny 
fraction of the whole neural network and has to make inferences based on it. Clearly the exact 
recovery of V* is out of question, but rather a good approximation to the local rules Pjig is 
desired. The model selection approach is a natural way to formulate this problem. 

We may wonder if the conditions in |BMS08L ICT06al IGOTlOl IRWLlOj are satisfied if the 
measure P of interest is not the one on A s but the projection on A Vm . Unfortunately, this is 
also not the case, because |BMS081 ICT06al IGOTlOl IRWLlOj assumed that P is Gibbsian and, 
in general, a projection of a Gibbs measure is not Gibbsian |FP97| . 
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1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Sample size 

Figure 4: Plot of the number of sample size n against the average of risk ratio for the models selected 
by the slope heuristic (solid line) and by the theoretical constant (dashed line). 



We think that these considerations alone are important enough to justify our work on model 
selection procedures for general random fields, but a more detailed comparisons emphasizes our 
points. 

|CT06a| considered the problem of recovering the support of P^g in an homogeneous, finite 
range random field based on one realization. The homogeneity is not realistic in our applica- 
tions and the comparison with our work is not straightforward, nevertheless we observe some 
interesting aspects. 

1. The consistency result in |CT06a| is asymptotic whereas all our results are non-asymptotic 
and hold for all n. 

2. They considered finite range interaction random fields eventually included into the ob- 
served sites. Our approach let us work with non-observed sites and infinite range random 
fields. 

3. The number of observed sites |A| in [CT06aj is the analogous quantity for the number of 
samples n in this paper. Theorem 2.1 in their article shows that they select a neighborhood 
of order o(log 1 ^ 2 n) among the o(log 1//2 n) closest sites. Our model selection algorithm can 
be applied in high dimension situations and allows maximum neighborhood size of 0(log n) 
selected from 0(e nl3 ), < (3 < 1, possible sites. 

4. They considered penalized log-likelihood estimators as those that we studied in the Kiillback 
case. Our results on Kiillback loss can therefore be seen as natural extensions of those in 
[CT06aJ for the model selection setup. Our penalty, designed for the oracle approach, is of 
AlC-type Ka v /n whereas they considered, for exact recovery, a BIC-type penalty of order 
Ka v Inn/n. This is a difference between the oracle approach and model identification that 
was already noticed in a regression framework, see for example [Yan05j. 
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[GOT10] considered the problem of recovering the support of P^g for infinite range Ising 
models in Z rf . The main restrictions in this work are that the interactions between the sites 
are supposed to be pairwise, weak ("high temperature") and that a subset of the observed sites 
of size 0(log(n)), where n is the sample size, must be fixed to apply the proposed procedure. 
Our procedure has no restriction on the strength of interaction, can be applied for non-pairwise 
interactions, and we do not need to fix a subset of observed sites. 

In [BMS08| . the analysis is restricted to finite random fields, where the maximum neighbor- 
hood size is known a priori. For infinite range random fields, these results are useless since the 
"constants" e and 5, that should be positive, are both equal to in general. More importantly, 
the procedure used the knowledge of the lower bound e on the bias term. As this e is unknown 
in practice, it is not clear, even if the underlying model is a finite random fields, how it should be 
evaluated. It is not clear how to generalize these results to the case where the maximum neigh- 
borhood size is allowed to grow with n. This would require a careful analysis of the behavior of 
the quantities e and 5 which are hard to compute even in simple models. Nevertheless, in the 
specific case when the underling random field is the Ising model, a straightforward computation 
Theorem 3 in [BMS08| shows that, when the number of total sites is 0(e nl3 ), < /3 < 1, the 
maximum size of the allowed neighborhood is O(logn). 

In |LTllj . we introduced a model selection procedure for Loo-risk. We worked with random 
fields with binary alphabet and under some restrictions on the probability measure P. We 
showed the superiority of the oracle approach compared to the identification procedures available 
in the literature. However, we were not able to prove the slope heuristic. In the present work, 
we obtained sharper oracle inequalities, we proved the slope heuristic and removed all the 
restrictions of [LT11]. 

The proof of the slope heuristic for general model selection problems is still in its beginning 
and our results of Section [4] are major contributions to this problem. In particular, we provide, 
up to our knowledge, the first proofs of this heuristic in a discrete framework. Moreover, 
our proof in the Kiillback case is the only one with [Saullj that holds for a non-Hilbertian 
risk. Finally, following the notations of [AM09] . the proofs usually rely on good concentration 
properties of the terms p\ and p2 and a comparison of their expectations. We proceed here with 
a direct comparison of these terms, proving some typicality results for the terms pi and p2- See 



Theorem 8.7 and Lemma 8.22 Our approach can be understood as a pathwise version of the 



strategy suggested in [AM09J. 

The work [R WLIO] is restricted to the Ising model on finite graphs and assumes the incoher- 
ence condition, which is a very restrictive (see |BM09j). Nevertheless, the use of ^-penalization 
allows a computationally efficient implementation of the algorithm proposed in |RWL10| . This 
is critical in applications. For the moment, our algorithm lacks the computational efficiency. 
To have a fast implementation of our algorithm or of an approximation of it will be our main 
task in a future work. 

We provide in Table [T] a comparative summary of the available results. 
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7 Proofs 

7.1 Proof of Theorem I37Q 

Let 6 > to be chosen later and let Q denote either P or P. We decompose the risk as follows 



q * — ' a 

xdX(V) 



E 



xex(y), Q{x{v/{i\))<e(av n y 



+ 



E 



xeX(V), Q(x(V/{i}))>e(a,vn)- 



As the cardinal of X(V) is a v and ( P i \y(x) — Pi\y(x)j < 1, the first term in this decomposition 
is upper bounded by On -1 . Hence 



Pi\V - Pi\V 



Q n 



+ 



E 



<3(*W«)) 



Pi\V - Pi\V 



(15) 



xeX(V), Q(x{V/{i}))>8{a v n)- 



15 



Hereafter in the proof of Theorem |3.1[ we denote by 

X e (V) = {xe X(V), Q(x(V/{i})) > di^n)- 1 } 
It comes from Lemma 18, II that 



P%\V -Pi\v\\p 



£ E 

x&X e (V) 



E 



n L — ' a 

xex 9 (v) 



P(x(V)) - P(x(V)) +P l \ v (x) [P(x(V/{i}))-P(x(V/{i})) 



aP(x(V/{i})) 



< 



y P(x(V)) - P(x{V)) j 
\ P(x(V/{i})) + ^ 

\ xeX s (V) V V x£X e (V/{i}) 



From Lemma 8.1, we also have 

P{x{V)) - P{x{V)) 
\Pi\v{x) - Pi\v(x)\ < 

Hence 



+ Pi\v(x) 



P(x(V/{i})) - P(x(V/{i})) 

P(x(y/{i})) 



P(x(V/{i}))-P(x(V/{i})) 



\Pi\v{x)-P A v{x)\ < 
Thus, 



aP{x(V/{i})) 

P(x(V))-P(x(V)) +(P i[v (x) + P ilv (x)) (P(x(V/{i}))-P(x(V/{i})) 



Pi\v -Pi\v\\p 



aJP(x{V/{i}))P{x(V/{i}) 



^ P(x(V/{i})) /g 



n ' — ' a 

x£X e (V) 



is smaller than 



E 

x&X e (V) 



( P(z(V)) - 


+ (P ilv (x) + P llv (x)) 


(p(x(v/m-p(x(v/m) 


r 


aP(x(V/{i}) 


) 



< 



E 

\ xex e (v) 



P(x(V)) - P(x(V)) 
P(x(V/{i})) 



+2 E 

xex 9 (v/{i}) 



P(x(V/{i}))-P(x(V/{i})) 

pJxW/W) 



8.14 



We use Theorem 



larger than 1 — 2e x , 



with b = VO 1 a v n, for all x > 0, for all > 0, we have, with probability 



Pi\v - Pi\v 



2 6 /, ,na v Ax 32a v x 2 
< - + - (l + r ? ) 3 - + 



Q n a 



re r/n 07] 3 n 



Take = 8a v / 2 xr] 3 / 2 , we obtain 



f*i|y _ Pi\v 



2 6 A s ,a v Ax 6a v / 2 x 
< - l + r, 3 - + — + — — 
Q a \ re r/n rfl^n 



Using ab < r/o + (Arj) b , we finally get 



Pi\V ~ Pi\V 



2 6 /. .a" 4x 9x 2 
< - l + 8r? — + — + ^r- 
Q a \ n r/n rfn 
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7.2 Proof of Theorem EOt 



For all probability measures Q, let (., .)q be the scalar product associated to the L2,Q-norm 
||.||q. Let V and V in the collection V s . We have 



- P«VUV'))P llv (x) 



x&X(VUV) 



xex(v) 



P(x(V/{i})) 



P i \ v {x)P i \ v {x) = (Pi\ V ,Pi\v) p - 



1 

a 



x£X(VUV') 

Hence, for all V, V in V s , 



P(x(VUV'))P ilv (x)= ]T 

xeX(V) 



P{x{V/{i])) p 2 M _|| P| ||2 
P%\V\ X ) - \\Pi\v\\p 



P 



i\V 



Pi\V ||p + ^ ( -Pill/ — PiW, Pi 



i\V — i|Vi r i\V 



+ 



Pi\v - Pi\V 



\Pi\V\\p + \\Pi\V — P%\V 



Pi\vfp-\\Pi\v\ 



(16) 



Moreover, from Pythagoras relation see Proposition 8.17, we have 

\\Pi\S - Pi\v\\ P = ||-Pj|s||p _ H-Pilvllp ■ 

By definition of V, we have, for all V in V s , 



P 



i\S\\ P 



P. 



i\V 



+ pen(y) < \\P AS \ 



P 



i\V 



+ pen(y) 



Hence, for all < v < 1, from (16), 



P%\S - P i\v 



< 



Pi\S - P t \y 



+ V 



p p ~ 

i\V %\v 



is smaller than 



I 1 1 2i 

\ P i\S ~ Pi\v\\p + pen(F) 



Pi\v - Pi\v 



pen(V) 



P i\V P i\V 



i\V r i\V 



+ ( \\P' i\v\\p ~ \\Pi\V\ 



P 



i\V 



+ 



p 



i\V 



+ - Yl (P( x (VUV))-P(x(VUV)))(P ]9 ( 
xeX(VuV) 



Pi\v(x] 



(17) 



We have also, 



I^VlIp- II^VlIp - \\Pi\v 



+ 



i\V 



\ J2 ( p ( x « v u ^/W)) - PM<V u v)/{i}))) (pIvW - rfai*)) ■ 



x&X((VUV)) 
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Let 0<rj<l, 8>1 and assume that, N s > 2. Let Q s be the intersection of the following 
events: 

ni = {W e v., |Pi| V - Pjll < 6 (d + + «MW>!) \ . 
[ 1 1 p a \ n jfn ) J 

IS - {w e v„ |p ( |v - Pilv t < ? (<i + 8 ,)£ + Ml. 

[ 1 1 P a \ n v^n ) J 

n s 3 = [w,V g vl ||p i|v ||| - ||p i|v ||J, - + ||iV'||p 



^ = i w, v e vi Yl ^ x{ y uy ')) - p ^ v u v)) Pllv ' {x) Pllv{x) 



xex(vuv) 



- \\ P i\V ~ P i\V'\\p \l 2 



ln(A^) ln(iV s 2 <5) 



71 



3ra 



(19) 



Theorem 3.1, Lemma 8.16 and union bounds give that 



p n 6 ) < 



For all V, V in V s and all £ > 0, on Q s , we have 

2 x; (p(*(vuv>)) - p( X (v u o) Pt|y,(a;) ~ Pilv{x) + IIP^III 



a;e*(VW) 
- IIP,- " 2 



1 2 II 1 1 2 £ 

Piv f? + Piv r, < ^ Pi 



~~ Ir i|v Hp lr i|v||p ^ 2 II *[v ~~ ^l^'llp ~i~ \ ~g 
From (17), we deduce that, on O 5 , for all < £ < 77, 

2 



3/7 



P|5 - P|y 



< (l + Z)\\p lls -P tlv \\ 2 p + pe n(V) 

- fpen(^)-(l + i/)(l + r ? ) 3 -- N ) 
\ an) 



77 V 77* a 



| + lj ln(A^)). 



Take at first < £ < ^ and < n sufficiently small to ensure that (1 + + t?) 3 < if to obtain 
0. To obtain ([8]), choose v = 1 and 7/ > sufficiently small to ensure that (1 + t/) 3 < if/2 
and £ = (La(iVg<y)) . We conclude the proof, saying that the inequality is obvious when 5 < 4, 
and, when S > 4, 



1 + (In A^)- 1 _ i + 2(lniV^)- 1 1 + 2(ln < *)- 1 : + 



1 - (lnA^)" 



1 - (IniVfJ)- 1 - 1 - (ln^)- 1 



ln<5' 
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7.3 Proof of Theorem IQ 

(2) 

Let V in V Sj a or ^ and let us define 

P1 (V)= ^ P(x{V))J^-Y P2 (V)= Y, P(^))lnf^MV 
From Lemma 18.211 and we have 



1() ^ (p(x(v)) - P(x(v))) 2 14 (p(x(y/W))-p(x(y/{,}))) 2 



^V)< Y E ^TtT^ ~ + T E 



3 ^ P(x(V)) 9 ^ P(x(V/{i})) 



E — L + ^ E v 



2 ^ x P(^)) 3 f-' , P(s(V7{«})) 

xeX(V) y K " x&X(V/{i}) y x /l > " 



Let V* = V or On the event £l pro b{5) defined in Lemma 8.20, thanks to Lemma 8.20 

have 



we 



1 1 1 + 2A" 1 / 2 iJn 

sup = < sup — = < sup — = < . 

xexty.) i/P(a;(K)) xs#(v.) \/P(:e(V;)) ze*(V*) Jp(x(y^) VMn(2a s iV s 5) 

As this quantity is not random, the same bound holds on £l pro b(5) c . We can apply Theorem 



8.14 to get that, for all x > 0, for all r] > 0, with probability larger than 1 — 2e x , 



44/. .,o' 4x 128x 2 
Pl(^) < - (l + r ? ) 3 - + — + 



n ryn nr/ 3 A ln(2a s iV s (5) / 



, . 23 /, ,oa v Ax 128x 2 
p 2 (^) < — (l + v) — + — + 



6 \ n r/n nr] 3 Aln(2a s N s 5) 

;e a union bound to obi 

1-5, 



(2) 

We use a union bound to obtain that, for all V in V Sj a or V S 7 (, with probability larger than 



, , 44/, .oo" /4 128\ln(2A^) 



n \rj r/ 3 A J n 
7.4 Proof of Theorem 13.4b 

Let us first decompose the selection criterion as follows. 
- ]T P(x(V)) In (P llv (x) ) + pen(F) = K(P i]s , P i[v ) + pen(F) - Pl (V) - p 2 (V) + L(V) 



xex(v) 



+ /™'"(^)' (20) 
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In the previous decomposition, we have 



P1 (V)= Yl P(^))ln(5^ 

\ Pi\v{x) 



x£X(V) 



P2(V)= £ P(x(Y))}n(^- 



L(V)= Y ^ x ^- P ^ V ^ }n (l^))- 



We deduce from (20) and the definition of V that, for all V in V Si a,p„j 

K(P i{s ,P. l9 ) < K(P lls ,P tlv )+pen(V)- P2 (V) - (pen(F) - Pl (V) - p 2 (V) ) + L{V) - L(V). 

(21) 

Let tlprobid) defined in Lemma 8.20 Let rj > and let £l p i, p2 (5) be the event, for all V in V S) A,p» 

Let be the event, for all V, V in V S) A,p*; 

(L(V) - L(V'))l nprob{5) < V (K(P lls ,P tlv ) + K(P rls ,P ilv ,)) + (41nn+ JL) . 



Let Q = O pro (,(<5)nfipi i p2(^)nfii(5). It comes from Lemma 8.20, Theorem 3.3 and Lemma 8.23 



that P(n c ) < 35. Moreover, on Q, we have, for all V in V S) A,p* > 

Pi(t>) + P2 (V) + L(F) - L{V) < pen(V) + ^(P,, s , P^ v ) + if(P j|5 , P^)) 



ln(iV?<5) / , 3 64 \ 
H 41nn+ h 1 + ^— I 



2t/p» (k-i) 2 /3a; 



Hence, on O, 



l^A'(P i|s .P„,) < K(P (|s ,P (|v) + pen0O + ^ (41n„ + ^ + 1 + j^^) 



^ (2) 

We deduce from (|20|) and the definition of V( 2 ) that, for all V in , 



K(P tls ,P tl?(2) ) < K(P ils ,P i]v )+pen(V)-p 2 (V)-( V en(V {2) ) - Px{V {2) ) -p 2 (V" (2) ))+^(^)-^(F(2)). 

, , (22) 

Let f2prob(^) defined in Lemma 8.20 Let 77 > and let Q pl p2 (S) be the event, for all V in p ^ 

*<v>s.(a + ^ + (i + £)^). 



n 



^) £ 4| (1 + ,)- ni + Ji)lM). 
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Let n { ^(5) be the event, for all V, V in V s (2 ] pt , 



(L(V) - L(V'))l nprob{s) < n{K{P i{ s, Pi\ v ) + K(P ils , P AV ,)) 



n 



4 Inn 



2r/p* 



(2) (2) 

Let n = n^ ob (5)nn y pl [ p2 (5)nri^'(s). it comes from Lemma 8.201 Theorem|3.3 
that P(n^) > 1 - 35. Moreover, on Q( 2 \ we have, for all V in vf] , 



and Lemma 8.23 



Vl{V {2) ) +P2(V {2) ) + L{V) - L(V {2) ) < pen(V (2) ) + v(K(P i{S , Pi\v) + K(P i[s , P,^)) 



+ 



n 



64 



41nn+ 2^ + 1+ (K-T) 2 /3Aj 



Hence, on 

i^K(P i|s ,P. |%) ) < jr(P 4 , 5l P, |v )+pen(V) + 

7.5 Proof of Theorem I4.lt 

Let us introduce, for all V in V s , 

|2 



ln(iV?<5) 



n 



4 Inn + 



2?^ 



+ 1 + 



64 



(^-1) 2 / 3 a) ' 



W = ||P < | V ||;-||P i|v ||i + - ^ (P(x(V))-P(x(V)))P llv (x). 



xex(v) 



By definition of V, we have, for all V in V s , 



P 



»|S||p 



p 



i|V 



+ pen(y) < llP^ | 



P; 



+ pen(F). 



Hence from inequality (16) in the proof of Theorem 3.2 we have, for all V in V s , 



Pi\S ~ P i{ v 



+ pen(F) 



P P - 



< ||P i|g -P i|y || p+ pen(F) 



L(V) 



P%\V - Pi\V 



L(V). 



(23) 



Let fi 



pen 



< pen(y) < (1 - r) 



Pi\v - Pi\V 



and let 0*^ = fi|nO|nOp e „, where ft* 



and are respectively defined in (18) and (19). It comes from Lemma 8.16 and our assumption 
on pen(V) that P((^nii npen ) c ) — 6 + 2<5 _1 . Moreover, on J^mpem we have, for all r] > 0, 



\L(V)-L(V)\< V 



Pqs ~ p i\v 



+ ri\\Pi\s-Pi\v\\ P + — + 1 



16 



3n 



(1-7?) 



P 



P 



P, 



< (1 + 77) ||P i|s -P^ ||; 
We conclude the proof choosing 77 = 1. 



P 



i[V ~ M|V 



+ 



16 



+ 1 



3n 
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7.6 Proof of Theorem EH 

Let 



n 



pen 



W G V„(l + n) 



< pen(V) < (l+r 2 ) 



~~ P i\V 



let ^comc = ^I^^i^^pen, where SI3 and ^4 are respectively defined in (18) and (19). It comes 
and our assumption on pen(V) that -P((^minpen)°) — e + 2<5 _1 . Moreover, on 



5.16 



from Lemma 
^minpen' we nave > from (23), for all 1] > 0, 



0--v) 



< (l + ?? )||p. |5 -p. |y ||^ + r 2 



P i\S ~ P i\V 


2 

+ n 

p 


P i\V P i\V 


l + (l + n)( 


P i\V P i\V 


2 
P 


P \V P i\V 



(l + r a ) 



-Pi|V - P%\v 



!7 -, 
+ I — + 1 
17 



3n 



Let C be the constant given by Lemma 8.8 and let 



VV G V., 





2 










P 


< Ce 


P i\V - P i\V 


:} 



It comes from Lemma 8.8 that P(Q*) > 1 — <5 1 . Moreover, on f2 C omp nfi*, we have, from (23), 
for all < 77 < 1, 



(I-77) 



P i\S - P \v 



+ (n - C(i + n)e) 



p — p ~ 



< 



< (1 + 77) - Pi| V ||p + (r 2 + C(l + r 2 )e) 



P i\V - P i\V 



2 | 6 k(jgg) D 
P rj n 



7.7 Proof of Theorem 14.31 

Let fi peri and fipen be the events, for all V in V Si a^, < pen(V') < (1 — r)p2(V) and for all V 



(2) 

in V*/ , < pen(V) < (1 - r)p 2 (V). It comes from (12 1 h that, on rip en , for all V in V S) A,p* 



p \v) -P2(V) < K(P i{S ,P ilv )-rp 2 (V) + L(V)-L(V). 



(2) (2) 

It comes from (|21|) that, on O pen , for all V in A , 



|V (2 )' 



■ MV(2)) < ^(^|5, ^|v) - r P2 (V) + L(V) - L{V {2) ). 



Let £l P rob(fi) be the event defined on Lemma 8.20 and 0,l(5) be the event, for all V, V in 
V s ,A,p„ , for all r/ > 

(L(V) - L{V')) < v(K&\S,Pi\v) + K(Pns,Piiv>)) + f 41nn+ " 



n 



27/p* 



From Lemmas 8.20 and 8.23, we have P(Q pro b(5)r)QL(5)) > 1—25 and, on £l pro b(5)r)£lL(5)r)£l pen , 
we have, for 77 = 1, 

-P2(V) < 2K(P tls ,P tlv )-rp 2 (V) + (4 In n + . 
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Let n { ^(5) be the event, for all V, V in V s (2 ] pt , for all r, > 



(L(V) - L(V')) < v(K(P lls , Pi\v) + K(P ils , Pi\v)) + 



n 



4 Inn 



2np* 



8.20 



and 



From Lemmas 
Qpen, we have, for r\ = 1, 



8.23 



we have P(n prob (5) nft[ 2) (5)) > 1-25 and, on fi profe (5)n^ J (5)n 



(2), 



-pa(V( 2 )) < 2^,5, P i]v ) - rp 2 (F) 



ln(iV s 2 5) 



71 



4 Inn + 



2p* 



7.8 Proof of Theorem 14.4b 

Let Slpen be the event, for all V in V s ,A,p*) (1 + rOi^OO < pen(V) < (1 + r 2 )p 2 (F). It comes 
from (21) that, on 0, pen , for all V in V Sj a,p*> 

K(P i]s , P 9 ) + ripi(V) + (1 + n)(p 2 (V) - Pl (V)) 



< K(P i]s , P i[v ) + r 2Pl (V) + (1 + r 2 )( P2 (V) - Pl (V)) + L(V) - L{V) 



Let fip ro &(5) be the event defined on Lemma 8.20 and Ol(5) be the event, for all V, V in V s ,A,p*) 
for all ry > 0, 



(L(F) - L(V)) < "(^(P|S, Pi\v) + ^(P|S, P^)) + 
From Lemmas 



ln(A^) 



n 



4 Inn 



27/73* 



8.20 



and 



8.23, we have P(Qprob(5)r\£lL,(5)) > 1—25 and, on J7 pro fe((5)nr2i(5)nf]p en , 
we have, from Lemma 8.22, for all V in V S) A,p*, 

C 



|pi(^)-p 2 (^)| < -7fPi(V). 

vA 



We obtain that, for all V in V. 



s,A,p* j 



(i - n)K(p j|s , p a9 ) + [n- C(1 ^ ri) ) ( r ) 



A 



< (1 + ^(P^, P, y ) + (r 2 + C( ^t r2) ) pi(y) + 



' 41nn + 



77 



27/p* 



Let 0^ be the event, for all V in V S (2 A , (1 + ri) P2 (V) < pen(V) < (1 + r 2 ) P2 (V). It comes 



from (|21|) that, on Op 2) n , for all V in V (2) 



s,A,P* ' 



^(P^.p^ ) + npi(y (2) ) + (i + n)( P2 (v i2) )- Pl (v i2) )) 



< K(P i[s , P i[v ) + r 2Pl {V) + (1 + r 2 )(p2(V) - + L(V) - L(V {2) ) 



Let £l P rob($) be the event defined on Lemma 8.20 and JlV (6) be the event, for all V 5 V 7 in 



vg*., for all /?>(), 

- < "(^(P|5, Piv) + ^(P|5, P|v)) + 



n 



4 Inn 
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From Lemmas 


8.20 


and 


8.23 


we ] 


^penj we have, 


from Lemma 


8.22 



)(2), 



>( 2 ), 



for all V in V 



(2) 



c 



\pi(V) - P2 (V)\ < -7=Pi(V). 

VA 



We obtain that, for all V in V 



(2) 

,A,p* ' 



(1 - r,)K{P^P i]%) ) + (n - ^J 11 ) 



< (1 + ^(^15,^^)+ ( r 2 

8 Probabilistic Tools 



C(l + r 2 ) 
/A 



Pi 00 + 



ln(iY s 2 <5) 



71 



4 Inn + 



2r/p* 



Lemma 8.1. Ze£ x in Af(5") ; Zei V be a finite subset of S and let Q,R be two probability 
measures on X{V) such that R(x(V/{i})) > 0. We have 

Q(x(V)) - R(x(V)) + Q i{v (x) (R(x(V/{i})) - Q{x(V/{i}))) 
Q ilv (x) - R t{v (x) = mvJW) • 

The lemma immediately follows from the fact that Qi\y{x)Q{x{V / {i})) = Q(x(V)) and R^v(x) = 
R(x(V))/R(x(V/{i})). 

We recall the bound given by Bousquet [Bou02j for the deviation of the supremum of the 
empirical process. 

Theorem 8.2. Let Xi,...,X n be i.i.d. random variables valued in a measurable space (A,X). 
Let T be a class of real valued functions, defined on A and bounded by b. Let v 2 = sup^gjr P[(f— 
Pf) 2 ] and Z = supj g jr(P n — P)f. Then, for all x > 0, 



P I Z > E(Z) + \ -{v 2 + 2ME(Z))x + — ) < e~ x . 
\ V n 3n I 

Bousquet's result is a generalization of the elementary Benett's inequality. 



(24) 



Theorem 8.3. Let Xi,...,X n be i.i.d. random variables, real valued and bounded by b. Let 



VariXx) and X n = n" 1 £™=i( x * ~ H x ))- Th en, for all x > 0, 



r , 2v 2 x bx 
P\X n >\l + 



< e" 



(25) 



n 3n 

Let us recall some well known tools of empirical processes theory. 

Definition 8.4. The covering number N(e, T, d) is the minimal number of balls of radius e with 
centers in T needed to cover T. The entropy is the log of the covering number H(e, T, d) = 
log(JV(e,r,d)). 

Definition 8.5. An e-separated subset of T is a subset {tk} of elements of T whose pairwise 
distance is strictly larger than e. The packing number M(e,T,d) is the maximum size of an 
e-separated subset ofT. 

Those quantities are related by the famous following lemma. 

Lemma 8.6. (Kolmogorov and Tikhomirov [KT6^) Let (T,d) be a metric space and let e > 0, 

N(e, T, d) < M(e, T, d) < N(e/2, T, d). 



24 



8.1 Concentration for Slope with quadratic risk 

The aim of this section is to prove the following result. 

Theorem 8.7. Let (S,A,P) be a random field and let V be a subspace in V s - Let X'(V) 
{x e X{V), P{x{V)) ^ 0} and let pY. = mf xeX , {v) P{x(V)) 

P(x(V)) 

64x/2 



Let Z = swPx^X'CV) ^~^~W^TWW^~^^ ■ For all 5 > 1, with probability larger than 1 — 5 1 



Z < 



np_ 



'In 



16 

pV 



2048 , 
+ — In 

np_ 



P 



+ 



/21nffl 2 H6) 



np_ 



np 



v ■ 



Let us state an important consequence of Theorem |8.7| 

Lemma 8.8. Assume that infygVsP- > e~ 2 n~ l \a.(nN s 8). There exists an absolute constant C 
such that, with probability larger than 1 — <5 _1 , for all V inV s , 



i\V ~ r i\V 



i\V - ^i\V 



< Ce 



P 



i\V - Pi\V 



Proof: Let V in V s and let X'{V) = {x e X(V), P(x(V/{i})) / 0}. We have 



P 



i\V - Pi\V 



P 



i\V - P i\V 



xeX'(V) 
< sup 

x£X'(V) 



\P(x(V/{i}))-P(x(V/{i}))\ 



\P(x(y/{i}))-p(x(y/{%}))\ 



Pi\v(x) 



p i\v(x) 



P{x(V/{i})) 



D i\V ~ Pi\V 



8.7 



We take a union bound in Theorem 
that there exists an absolute constant C such that 



and we obtain, since inf y g y s pY > e 2 n 1 ln(nN s 5) 



\/V G V s , sup 



|P(x(y/{i}))-P(x(F/{i}))| 



P(x(V/{t})) 

In the remainder of this section, we prove Theorem |8.7| 



< Ce. 



Proposition 8.9. Let P be a probability measure on X(S) and let V be a finite subset of S. 
Let X'(V) = {xe X(V), P{x(V)) £ 0} and let p v _ = ini x&x , {v) P{x{V)). 



Let Z = sup^^y) 



\P(x(V))-P(x(V))\ 



P(x(V)) 



. For all 5 > 0, with probability larger than 1 



Z < 2E (Z) + 



/21n(5) | 2 ln(6) 



np_ 



np 



v ■ 



Proposition [8]9] is a straightforward consequence of Bousquet's version of Talagrand's inequality, 
that we apply to the class of functions T = {(P(x(V))) l x ry\}. 
The second proposition let us compute this expectation. 

Proposition 8.10. Let P be a probability measure on X(S) and let V be a finite subset of S. 
Let X'(V) = {xe X{V), P(x{V)) / 0} and let p v _ = inf^^y) P(x(V)). 

P(x{V)) -P(x(V)) K 



E 



sup 

xEX'(V) 



P(x(V)) 



< 



32^/2 



np_ 



'In 



1024 



16 

pY J npY 



In 



16 

pY 
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Proposition 8.10 was proved in |LT11| . We recall the proof here for the sake of completeness. 
Let (Aj)j 6 j be a collection of sets such that, for all i ^ j E I, Ai D Aj = and let (ai)j £ / be 
a collection of positive real numbers. Let Zj = sup tg j- |(P n — P)t|, where J 7 / = {£, = ajl^}. 
Here and in the rest of the proof, a* = sup ieI ai, p* = sup ieI ajP(Ai). The following result 
can be derived from classical chaining arguments (see for example [B ou02] ) . 



Lemma 8.11. Let J- be a class of functions, then 



E ( sup \{P n -P)t \) < ^5e 



' rD n /2 \ 

J # 1/2 (u,P,d 2 , P Jduj 



where the distance d,2,p n (t,f) = \J P n [(t — t') 2 ] and the diameter D n = \J sup te jr P n (t 2 ). 

In order to apply this result to T = Ti, we compute the entropy of Tj. For all i / j, since 

(U - tjf = (ctilAi ~ acjlAj) = oflAi + ct) l A r 



Ai n Aj 



Hence d 2 , Pn (t i ,t j ) = yJa 2 P n (Ai) + a 2 P n {Aj). 

Consider an e-separated set T e = {i^, ...,U N } in (J 7 /, d,2,p n ), it comes from the previous compu- 
tation that, for all k ^ k', 

a 2 fc P n (^J + a 2 fc ,P„(A v )>e 2 . 
Hence, there is at least N — 1 indexes k £ {1, ...,N} such that 

a 2 lk P n {A ik ) > |. 

It follows that 

N 2 



l = Y, P n{Ai)>Y,Pn{A lk )> e -^^. 
Hence N < 1 + 2(a*) V 2 , thus 



2(a* 

i6/ k=l v 



H(e,T I ,d 2 , Pn ) <log(l + 2(a*)V 2 ). 



We deduce from this inequality and Lemma 8.11| that 

E ^sup |(P n - P)i|^ < ^2e 0og(l + 2(a*) 2 e- 2 )de^ 

32 / rv 7 ^/ 2 , \ 

< -j=& I y ^log (2a*e- 1 )de j , 

where j5* = sup ig / a 2 P n {A,{). Now, let us recall the following elementary lemma. 
Lemma 8.12. For all positive K,A such that K/A > e, we have 



J yJ\og{Kx-^)dx < 2AJlog (*£j 
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Actually, 



pA roo 

/ \f\ag{Knr x )dx = K 
Jo Jk/a 



ylog 



\x) I ( K\ K r °° 

-dx = Ai/loR I — I + 



A J 2 J K/A u 2 y/logu 



Since K/A > e, u - w \ ogu < on [K/A,oo[. The result follows. 

By definition, p n < (a*) 2 , hence 2a*/(y / ^/2) > 4 > e, we deduce from Lemma 



8.12 



du. 



that 



E ( sup \(P n -P)t\) <^E 



'N 



log 



4a* 



Let us now give another simple lemma. 



Lemma 8.13. The function f : x i— > X\/\og{K/x), defined on (0, K) is positive, non decreasing 
on (0, K/e 1 / 2 ) and strictly concave. 

The proof of the lemma is straightforward from the computations 

1 1 1 



f'(x) = y/log(K/x) 



2^/log(K/ 3 



-, r(x) 



2xy/\og(K/x) Ax{^J\og{K/x)f 



It comes from Lemma 8.13 and Jensen's inequality that 



E ( sup \{P n -P)t\ ] < ^E[^p* n 
Now it comes from Jensen inequality that 



log 



4a* 



\ V E (v^)/ 



E 



a*E sup \ (P n -P)t\ . 



< v/E[p*] < V / P* + 

It is clear from its definition that p* < (a*) 2 . Moreover, as P n and P are probability measures, 



\ 



we have, for all t in J 7 /, \(P n - P)t\ < 2a*. Hence, Ja*E (sup^ |(P n - P)t\) < \f2ct* . We 
deduce from these inequalities that 



p* + 



i 



a*E sup \{P n - P)t\ J < (1 + V2)a* < (4a*)/e 1/2 . 



Hence, it comes from Lemma 8.13 that, if E = E (sup t6 j- \{P n — P)t\) 

32 



E < 



< 



P 



+ ^E) Jlo. 



4a* 
/jf + Va*E 
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j* + Va*E) Jlo. 



4a* 



It is then straightforward that 



^ , , „ „x , 1 64 / , / 4a* \ 2048 «, , /4a* 

E ( sup |(P n - P)t\ < Jp* log (^J + —a* log f -j= 



(26) 



27 



In order to conclude the proof of Proposition 8.10, for all x £ X'(V), let A x = x(V), a 2 
[P(z(V))] -1 . We have 

a*= sup [P( x (V))]- 1 = ^ v ,p*= sup [P(x(V))]- 2 P(x(V)) < \. 

xdX'(V) P- xEX'(V) P- 



Therefore, the proposition is straightforward from inequality (26). 



8.2 Concentration of the variance term in quadratic risk 

The aim of this section is to prove the following concentration result, that is at the center of 
the main proofs. 

Theorem 8.14. Let V be a finite subset ofS. Letb>0 andletX b (V) = {i£ X(V), P(x{V)) > 6~ 2 }. 
For all x > 0, rj > 0, we have, 



P 



( 

E 

\xEX b (V) 



P(x(V)) -P(x{V)) 
P(x(V)) 



a" Ax 32b 2 x 2 



< {l + r/Y— + — + 



n rjn rj 3 n 2 



> 1 - e~ x . 



Proof: Let us first recall the following consequence of Cauchy-Schwarz inequality. 
Lemma 8.15. Let L be a finite set and let be a collection of real numbers. We have 

^2 b l = su p E aibi ■ 

ie i \(<H)iel, E ie i^<l iei ) 

Proof: The lemma is obviously satisfied if all the b% = 0. Assume now that it is not the case. 
By Cauchy Schwarz inequality, we have, for all collection (aj)j 6 _f such that ^2 ieI a 2 < 1, 



(e^^e^e^e^ 

Viei / iei iei" iei 



Moreover, consider for all i in /, m = bi/yY^iel^i^ we have 5^iei a i = ^ an< ^ Siei a ^* 



J2iei ^i ' wn i cn concludes the proof. 
Let us now introduce the following set. 



B b v = { f : X b {V) -> R such that / = V l < * xlx . 



where a 2 < 1. 

xex h (V) 



Let P and P n be the following operators, defined for all functions /, by P n f = ^ Ya=i fi^-i) 
and, for all functions / in L l (P), by Pf = f f(x)dP(x). Using Lemma 8.15 with / = X b (V) 
and 



P(x(V)) - P(x(V)) 



we obtain, 



x€X b (V) 



P{x{V)) -P(x(V)) 

p~W)) 



1 



x(V) 



sup(P„-P)/ , 
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The functions / in B v satisfy 



Var(.fpO) < Pf 2 = ]T 



«7 



xEX b (V) 



P(x(V)) 



P{x{V)) < 1, 



m < sup = < b. 

°° " zexw y/P(x(y)) - 



From Theorem 8.2, we have then, for all rj > 0, for all x > 0, 



P[ sup (P n -P)/>(l + r 7 )E( sup (P n -P)/) + V— + r^-"--^ — J ^ e " :E - 
From Cauchy-Schwarz inequality, we have then 



E sup (P n - P)f )< 



We have obtain that 




xeA'(y), P(x(v))^o 



E P(x(F)) - P(x(^)) 



WW)) 



E 



Var(l x( y) =:r (i/)) /a 



e*(v), p(x(v))^o 



nP(x(V/{i})) ~ V n 



p b ft - p|/>(1+ ^ + ^ + (H)" 



Since Py is symmetric, supy gS t (P n — P)f > 0. We can therefore take the square in the previous 
inequality to conclude the proof of the Theorem. 



8.3 Concentration of the remainder term in the quadratic case 

Let us now give some important concentration inequalities. 

Lemma 8.16. Let V , V' be two subsets in V s . For all 5 > 0, we have, with probability larger 
than 1—5, 

- £ (P(x({V U V')/{i})) - P(x((V u v')/{i}))) (P% v (x) - pfak 



x£X{(VUV')) 



-"IP P II d2 H5 K H5) 
- -\P i \v-Pi\v\\ P ^2— + —. 



(27) 



- Yl U - P«VUV')) (P ilv ,(x) - P AV {x)) 



xex(yuv) 



■m lv -p, lv .\\ P ^ + ^. (28) 
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Proof of Lemma 8.16: Let f\ be the real valued function defined on X{{V U V')/{i}) by 



? 1= E [ p i\v( x ) - P i\V'i x )) 
xex((vuv')) 

h= Yl ( p i\v( x ) ~ p i\v{x)) l*(vuv)- 
xex((vuv')) 

h= Yj l <(VUV')/{i\)) Y ( P i\v( x b) ~ P i\V'( x b) 

x&X{{VUV')/{i}) b&A 



We have 



fx is upper bounded by max ffie *((VuV')/{0) T<beA [ P i\v( x b) ~ P i\y( x b)) - a - h is u PP er 
bounded by m.ax. xeX (y u yi\ \Pi\v ( x ) ~ P i\v( x ) ■ Since, for all x 7^ x' in X((V U V')/{i}), 

1 x((VUV")/{i}) 1 i'((VUl")/{»}) = °> We naVe 

Var(/ 1 (X))= Yl P(x({VUV')/{i}))) ifyfo) - P ?\v>( x b)) 

x&X((VUV')/{i}) \b£A J 

< a p ^ v u ^o/w))) E ( p t^) - ^iv(^)) 2 

i6t((VUV')/{«}) f'e- 4 

< 4a ^ (P i|y (x) - P AV ,{x)f P{x(V U *"/{»})) 

211 1 1 2 

= 4a - -PilV'Hp ■ 

Since, for all x ^ x' in A?(V U V'), l^yuv'^'tyuv') = 0, we have 

Var(/ 2 (X)) < Y {Pi\v{ x )-Pi\v>{ x )) 2 P{VUV>)<a\\P Av -P i]v f p . 

x£X((VUV')) 



Inequality 27 is therefore a consequence of Benett's inequality, see Theorem 8.3 We obtain 



Inequality |28| exactly with the same arguments. □ 



Pythagoras relation 

Let us give here Pythagoras relation that we used several times. 

Proposition 8.17. Let (S,A,P) be a random field, let i in S and let V be a subset of S and 
let f be a function defined on X(V). Then, the following relations hold 

[ f(x(V))P lls (x)dP(x(S/{i})) = [ f(x(V))P tlv (x)dP(x(V/{i})) 



xex(s) 



f(x(V))P ilv (x)dP(x(S/{i})). 



In particular, we have 



Pi\v - Pi\S 



Pi\v - Pi\V 



\Pi\V - P i\s\\p ■ 



\Pi\V - P i\s\\ P - \\Pi\s\\ P 



P 



i\V\\ P 



30 



Proof: The first inequality comes from the following computations. For all x in X{V) and y in 
X(S/V), let x(V)ey(S/V) be the configuration on X(S) such that (x(V) © y(S/V))(J) = x(j) 
for all j in V and (x(V) © y(S/V))(j) = y(j) for all j in 5/V. By definition of the conditional 
probabilities P i \y(x), we have 

/ f(x(V))P lls (x)dP(x(S/{i})) 

f(x(V))dP(x(V/{i})) [ P ilS (x(V) © y(S/V))dP{y{S/V)\x{V)) 

xex(v) J y ex(s/v) 

fixivm^dPixiv/m. 

xex(v) 

The second inequality is a straightforward consequence of the first one. For the third one, we 
apply the second inequality to f(x(V)) = P^y — Pi\y, we have 

/ f(x(V))P lls (x)dP(x(S/{i})) = [ f{x{V))P AV (x)dP{x(S/{i})). 

JxeX(S) Jx&X(S) 



Thus, 



Pi\v -Pi\s\\ 2 p = \\f(x(V)) + Pi\v ~ Pi\sf P = \\f(x(V))\\ 2 p + \\Pi\v ~ Pi\s\\ P + 



f(x(V))P ilv (x)dP(x(S/{i})) - / f(x(V))P lls (x)dP(x(S/{i})) 

a \Jx£X(S) Jx&X(S) 



Pi\v - Pi\V 



2 .. ,. 2 



p + \\Pi\V ~Pi\S\\p- 
For the last inequality, we use the second one with f(x(V)) = P i \y(x), we have 



\Pi\v - P i\s\\ 2 p = \\ p i\vfp + \\Pi\sfp - 2 [ f(x(V))P lls (x)dP(x(S/{i})) 

a Jx&X(S) 

= \\Pi\v\\p + \\Pi\s\\ 2 p - 2 [ f(x(V))P llv (x)dP(x(S/{i})) 

a JxGX(S) 



xeX(S) 

P. ||^ i || p 1 1 2 O II P 1 1 2 lip 1 1 2 lip II 2 
■ i\V\\p + ||M|5||p ~ z \\-^i\V\\p — \\ r i\S\\p - H-Mivllp- 



8.4 Basic tools for Kiillback Loss 

Let s be an integer larger than e. Let V s be the collection of subsets of V with cardinality 
smaller than s. Let N s be the cardinality of V s . Let i be a site in V. Let us first give an 
elementary lemma on Kiilback losses. It is a slightly sharper version of Lemma 6.3 in |CT06b| . 

Lemma 8.18. Let P, Q be two probability measures on a finite space A such that, for all a in 
A, \P(a) - Q(a)\ < vQ( a ), with n < 1/3. Then 

1 7tA (P(q)-Q(a))2 ^ ^ p , f P(a)\ (I 5 V \ ^ (P(a) - Q(a)) 2 

Proof: Let us first prove the following inequality, that is valid for all x < 1/3. 
x — x 2 ( — h - ) < ln(l + x) < x — x 



2 2 
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It comes from the Taylor expansion. 

x 2 ^ {-l) k+l x k 



ln(l + x)=x- — +Y, 



2 ' ^ k 

k>3 



2 1 
X X T] 



X — X 



k>0 



ln(l + x) = x 



x 2 v-^ (—l) k+1 x k x 2 x 2 n 

— + > '- > x ' 

^ h - 2 3 



2 ^ k 

k>3 



2 3(1-7?) 



fe>0 



7?) / ' 



We deduce from this inequality and the equality 

P(a)-Q(a) ^ . (P(o) - Q(a)) 2 



E^( 



0(a) 



E 



Q(a) 



that 



^P(a) In 



<E P W Q{a) -[-o--o)l^ 



Q(a) 



2 2)^ A Q(a) Q(a) 



E 



(P(a) - Q{a)) 2 fin 



Q(a) 



2 + 2 + 



P(a) 



Q(a) 



1 r\ 

2 ~ 2 



P(a)-Q(a) /l , 77 \ ^ P(a) (P(a) - Q(a)) 5 



0(a) 



+ 9 E 



2 27^Q(a) Q(a) 



E 



(P(a) - Q(a)) 2 / 1 n 



Q(a) 



2 2 



P(a) 



Q(a) 



1 n 

- + - 

2 2 



8.5 Basic Concentration Inequality 

Let us now give an elementary concentration results derived from Benett's inequality. 

Lemma 8.19. Let 5 > 1. With probability larger than 1 — <5 _1 , for all (V x x) G (V s x A'(S')), 
tec Hclvg 

P( X (V)) - P(x(y)) < J*rW))WW + H^W), 

V n 3n 

Proof: Let V in V s and x in A'(V), we have from Benett's inequality, for all t > 0, 



P 



P(s(F)) - P(x(V)) 



„ 2V " r(1 ^' )t + ±)< 2e- 
n 3n / 



We have Var(l x (y)) < P(x(y)). Hence, we conclude the proof with a union bound. 
We deduce from Lemma |8.19| the following typicality results. 

Lemma 8.20. Let A > 100. Let f2 pro b(o~) be the following event, 
\ V{V,x) £ Vn X *(£), P(X(V)) - P(X(V)) 



„ ^ 2P(x(y))ln(2a s iY s 5) ln(2a s iV s <5) 



3n 



PFe /iaue P(fi pr o&(^)) > 1 — 5 1 and, on fi pr - f,((5), for all V in V s and all x in X(S) such that 

ln(2a s N s 5) 



P{x{V)) > A- 



n 
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We have 



P(x(V))ln(2a°N s 5) < 2P(x(^)) 



n 



P(x(V)) - P(x(V)) 

Pi\ v (x) - Pi\ V (x) 

On r2 pro f,(5) ; for all V in V s and all x in X(S) such that 

ln(2a s N s 5) 



A 



P(x(V)) > A 



n 



We have 



P(x{V)) - P{x(V)) 
p i\v( x ) ~ Pi\v( x ) 



r P(s(tQ) Mgo^) < 2P(x(F)) 



A 



Proof: When P(x(V)) > A ln(2QSjVB<5) , we have 



3n 



3V2A 



ln(2a s A^) 1 /2P(x(y))ln(2a s A^) , P(x(V))ln(2a s N s 5) P(x(V)) 
< — ;= \ I , ana \ I < 



n 



n 



A 



This gives the first inequalities, as y/2 + (3\/2A) 1 < 2. We also have, since P(x(V/{i})) > 
P(x(V)), 



P{x(V/{i})) - P(x(V/{i})) 



P(x(V/{i}))]n(2a°N 8 5) < 2P(x(V/{i})) 



■;?. 



A 



From Lemma 8.1, we have 



< 



Pi[v(»)) - Pi\v{x) 
Hence, 

Pi\ v (x) - Pi\ v (x) 
We just prove that 

Pi\v(x) „ A+2A- 1 /2\ 



P(x(V0)-P(a;(y)) +P,|y(x) P{x{V /{%})) - P(x(V/{i})) 



P(x(V/{i})) 



Pi\v(x) - \ 1 - 2A-V2 



/ ln(2a s N s <5) / / 7- ~ . . 



1 + 2A-V2 



hence P i |y(x) < \j Pi\v{x) < - _ 2A _ 1/2 y p i|V (a 



Therefore, 



< 



ln(2a s A^) 



l-2A- 1 /2y n p( s (y/{i}))V 



4 hi(2a 8 N a S)„ . , 4 , . 

" l-2A~^V nP(x(V))^ v{x) " W^2 P ^ x) ' 
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Let u 2 = n 1 \n{2a s N s 5). On f2 pr . b, we have 

p(x(v)) < p( S (i/))+2^ypRtoj+^ = [ypWY) + ^) 

Since P(x(V)) > Au 2 , we deduce that 

5 /6A- 1 



> 2 2 
1i \ U 

+ 12- 



p(x(y)) > wa 



i i 

12 ~ 



Since A > 2, we deduce that 



P(x(V)) - P(x(V)) 



< 



u = A + 



\ 



12 



u 2 . 



v / 2 + 



V 



X/PRV 7 )) < 2u^P{x{V)). 



From the same inequality, we also obtain 
P(x(V)) - P{x{V)) 



< 2P(x(F)) < 2P(x(V)) 



A 



l i 



A 



12 

Since P(x(V/{z})) > P(x(V)), we prove with the same arguments that 



P(x(V/{i})) - P(x(V/{i})) 



P(x(V/{i}))ln(2a s N s 5) < 2P(a:(V/{«})) 



A 



From Lemma 8.1, we have 



< 



Pi\ v {x) - Pi\ v (x) 
Hence, 

Pi\ v (x) - Pi\ v (x) 
If ■ S /P~i\v > A|v> we deduce that 



P(x(V)) - P(x(V)) +P l \ v {x) P(x(V/{i}))-P{x{V/{i})) 



p(x<y/{i})) 



< 2k 



I ln(2a s N s 5) 
nP(x(V/{i})) 



Pi\v(x) + Pi\ v (x) 



Pi\ v {x) - Pi\ v (x) 



< 44 



/ ln(2a^^)P i |y(x) 
nP{x(V/{i})) 



Otherwise, we have 

fi- 4 



A 



1 + -^ jPi[v(*)< I 1-^ 



' ln(2a s iV a <S) 
nP(x{V/{i})) 



Pi\v(x) < Pi\v(x). 



Since A > 100, we obtain P i \y(x) < 2P i \y(x). We deduce that 
P Av {x) - P i]v (x) <2(1 + V2 



l\n(2a s N s 5)P ilv (x) 



nP{x{V/{i})) 



2(1 + V2 



nP(x(V)) Pi]v{x) 



< 



2(1 + y/2) 



A 



l + -7=Pi\v(x)< y T Pi\ V ( 



X 
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8.6 Control of the variance terms in Kiillback loss: 

The following Lemma gives an important decomposition of the Kiillback loss. 

(2) 

Lemma 8.21. Let A > 100 and let V s ,a, V»a be respectively the collection of subsets V in V s 
such that, for all x in X{V), 

]n(2a s N s 6) 



P(x(V) = 0, or P(x(V)) > A- 



and the collection of subsets V in V s such that, for all x in X(V), 

ln(2a s N s 5) 



P{x{V) = 0, orP(x(V)) > A- 



n 



Let £lprob{5) be the event defined on Lemma 8.20 On £l pro b(5), for all V in V S) a, we have 



20 



rf)<i E 



xex(y) 



P(x{V)) -P(x{V)) 
P«V)) 



+ 



14 



E 



*e#(V/{*}) 



P(x(V/{i})) - P(x(V/{i})) 
P{x{V/m 



^ v ^o. E — - + * E 



P(x(V/{i})) - P(x{V/{i})) 



xex(v) 



8.20 



Proof: From Lemma 
y/llh~ 1 P i \y{x). Hence, from Lemma 



P{x{V)) 3 ^ P(x(V/{i})) 

xex(v/{i}) 

(2) ^ 

for all V in V Sj a or V s ^, for all x in 1/, we have \Pi\y(x), P^y (»)| < 



8.18 



P1 (v)= e n*(v))in(p^l) 



t&X{V) 

1 

2 \ :] V 'A 



xeA'(v) 



1 

2 \ 



xS#(V) 



P%\vip) 



From Lemma 8.1, we have 



P i\v( x ) ~ Pi\v(x] 
Pi\ v {x) - Pi\ v (x) 



< 



< 



P(x(V)) - P(x(V)) +P AV {x) P(x(V/{i}))-P(x(V/{i})) 

p«v/{i])) 

P{x{V))-P(x{V)) +P l \ v {x) P(x(V/{i}))-P{x{V/{i})) 



p(x<y/{i})) 
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The second inequality gives that p\{V) is smaller than 

/ svTI \ (p{x(V))-P{x{V))) 2 ^ (p{x{V/{i}))-P(x{V/{i}))) 2 

- { 1 + ^7x) x ^ v) p(x(vmm lv (x) + p ^ v{x) wmW) 

i + (p«v)) - P(x(v))) 2 / 5VIT \ (p(x(y/{z}))-P(x(y/W))) 2 

~l W P(*(V)) + { 1 + J7Xj P(x(V/{i})) 



We also have 

2 (p(s(V0) " ^(^(^))) 2 + Pi|v(s)Pi|v(x))P(x(y/{i})) - P{x(V/{i})f 



+ 



P(x(V/{i}))P(x(V/{i})) 
(P llv (x)+P l{v (x))) P(x(V/{i}))-P(x(V/{i})) P(x{V))-P{x{V)) 



P(x(V/{i}))P(x(V/{i})) 



3 (p(x(V)) - P{x{V))) 2 + 2(P i]v (x) + P AV {x)f [p(x(V/{i})) - P(x(V/{i}))^ 
~ 2P{x(V/{i}))P(x(V /{{})) 
Hence, 

2 



3 x _ { P(x{V))-P(x(V)) 



2 ^ P{x{V)) 

[pjxjv/m - p(x(v/{i}))) 2 { p tlv(x) + p t|y(x) )2 

is smaller than 

3 „ (fW)-f(iW))' / /TT\ (p(x(v/{i}))-p(x(v/{i}))y 

5 Jjo >^ + i 2+ V a Wi • 

8.7 Concentration for the slope heuristic in the Kiillback case 

(2) 

Lemma 8.22. Let A > 100 and let V Sj a anc ^ K: i &e respectively the collection of subsets V in 
V s such that, for all x in X(V), 

P(x(V) = 0, or P{x(V)) > A M^M 

n 

and the collection of subsets V in V s such that, for all x in X{V), 

\n(2a s N s 5) 



P{x(V) = 0, orP(x(V)) > A- 



n 



Let Q p rob{°~) be the event defined on Lemma 8.20 On £l P rob(°~), there exists an absolute constant 
C > such that 

\pi(v) - P2 (v)\ < 4=pi00- 

vA 
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Proof: We use Lemmas 8.18 and 8.20 On ^l pro b(5), we have 

^2 



piW<2 i + — = 2^ p (x(Y/{t})) -r • 



<P2(^)= P(x(^))ln 



Pi\v(?) J 



xt=X(V) 



x£X{V) 



8.8 Concentration of L(V)-L(V) 

The following Lemma let us control the remainder term in the oracle inequality. 

Lemma 8.23. Let 5 > 1 and let V St A,p t be the subset of V s ,a of the sets V such that, for all x 
in X(V ', Pj\y(x) = or P^ v (x) > p*. With probability at least 1 — 5, for all V, V in V s ,A, Pt ,, for 
all r] > 0, we have, 



£ (P(x(V U V')) - P(x(V U V')) In ( 
x£X(VUV) v l \ v K ' 

< V (K(P ils ,P llv )+K(P t]s ,P llv ,)) + ln( ^ (41nn+ JL) . 

(2) (2) 

Let V s j^ be the subset ofV s ^ of the sets V such that, for all x in X(V, P^ v (x) = or 

(2) 

P%\vi x ) — P** With probability at least 1 — 5, for all V, V in V s , for all rj > 0, we have, 
£ (P(x(V U V')) — P(x(V U V')) In ( ) 



xex(vuv) 



< V {K{P AS ,P i{v ) + K(P i{s ,P i{v ,)) + ( 41nn+ ' 



Proof: Let us first write 



(P(*<y U V')) - P(x{V U V')) In ( ) 

x&X{VUV') ^ i\V'\ ) J 

< Yl (P(x(VUV'))-P(x(VUV'))(ln(^p^)-ln 

^v77^„m V V Pi\V'{X) ) V Pi\V{%) 



x&X(VUV) 
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Let us now write 14 for V or V' . We have 

£ (P(s(y u O) - u O) ^ ( Pll ; uv ^ x) ) 

-<*.-*>( E ln(^f ) ) 1 ^ )| . 

\a;e*(VuV') v »|v»v ,/ / 

The function / : U V) — >• M, x h- >• In ^ p^^'i) " ) ^ s u PP er bounded on f2 pro ;,(<5) by 2 Inn. 
Since it is not random, the bound also holds on al pro b(<5) c . Let us evaluate its variance 

Let us recall also here the following Lemma sec [Mas07], Lemma 7.24 p 275 or jBS91]) 
Lemma 8.24. For all probability measures P and Q, with P « Q, 



v*(f{x))<pf= £ p ^ vuv ">'>{ hl { P pT^w)) 



\ f id P A *» (in (g) ) 2 < *(P, 0) < i /(dP V «» (in (^)) 2 • 
Since P i |y )it (a;) > 2P i \ Vt (x)/3 > 2p*/3, we deduce that 

\ a r(f(X))<^-K(P i{vuv ,,P m ). 
Applying Benett's inequality to /, we obtain that, with probability 1 — 2e~ t , 



xeA'(yuy') 



U2 



We conclude the proof with a union bound and the classical inequality 2ab < r\a + rj b 
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